Improved consensus clustering via linear programming

نویسندگان

  • Nicholas Downing
  • Peter J. Stuckey
  • Anthony Wirth
چکیده

We consider the problem of Consensus Clustering. Given a finite set of input clusterings over some data items, a consensus clustering is a partitioning of the items which matches as closely as possible the given input clusterings. The best exact approach to tackling this problem is by modelling it as a Boolean Integer Program (BIP). Unfortunately, the size of the BIP grows cubically in the number of data items, hence this method is applicable to only small sets of items. In this paper we show how to tackle the problem progressively, leading to much improved solution times and far less memory usage than previously. For the case where approximate clusterings are acceptable, we show a number of heuristic techniques for extracting good clusterings from the solutions of the linear relaxation of the BIP, and on several very large data sets we demonstrate much higher quality approximations than previously possible. When optimal solutions are desired, the problem is much harder, and we present some novel and existing techniques that can assist in finding candidate answers and proving the optimality thereof. For the first time we present optimal Consensus Clusterings for several complete, albeit small, data sets.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Entropy-based Consensus for Distributed Data Clustering

The increasingly larger scale of available data and the more restrictive concerns on their privacy are some of the challenging aspects of data mining today. In this paper, Entropy-based Consensus on Cluster Centers (EC3) is introduced for clustering in distributed systems with a consideration for confidentiality of data; i.e. it is the negotiations among local cluster centers that are used in t...

متن کامل

A Consensus Embedding Approach for Segmentation of High Resolution In Vivo Prostate Magnetic Resonance Imagery

Current techniques for localization of prostatic adenocarcinoma (CaP) via blinded trans-rectal ultrasound biopsy are associated with a high false negative detection rate. While high resolution endorectal in vivo Magnetic Resonance (MR) prostate imaging has been shown to have improved contrast and resolution for CaP detection over ultrasound, similarity in intensity characteristics between benig...

متن کامل

On aggregating binary relations using 0-1 integer linear programming

This paper is concerned with the general problem of aggregating many binary relations in order to find out a consensus. The theoretical background we rely on is the Relational Analysis (RA) approach. The latter method represents binary relations (BRs) as adjacency matrices, models relational properties as linear equations and finds a consensus by maximizing a majoritybased criterion using 0-1 i...

متن کامل

CONCORD: a consensus method for protein secondary structure prediction via mixed integer linear optimization

Most of the protein structure prediction methods use a multi-step process, which often includes secondary structure prediction, contact prediction, fragment generation, clustering, etc. For many years, secondary structure prediction has been the workhorse for numerous methods aimed at predicting protein structure and function. This paper presents a new mixed integer linear optimization (MILP)-b...

متن کامل

انتخاب اعضای ترکیب در خوشه‌بندی ترکیبی با استفاده از رأی‌گیری

Clustering is the process of division of a dataset into subsets that are called clusters, so that objects within a cluster are similar to each other and different from objects of the other clusters. So far, a lot of algorithms in different approaches have been created for the clustering. An effective choice (can combine) two or more of these algorithms for solving the clustering problem. Ensemb...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2010